The trouble with DBpedia
poor object resolution defeats the point of the Semantic Web
Generic Databases
DBpedia is an example of what I call a "generic database", that is, a database that models concepts that are in people's shared consciousness. The schema.org vocabulary covers this space; some other generic databases are Freebase and Wikidata.
DBpedia is popular because of its support for RDF standards, simple naming conventions, and straightforward vocabulary. Unfortunately, people who use DBpedia often encounter problems with data quality that siderail their projects. These issues :BaseKB, a pure RDF product based on Freebase, to create a source of high-quality data that works in conjunction with DBpedia.
Will the real topic please stand up?
Wikipedia derives the dbprop:country
property from Wikipedia infoboxes. If you were looking to find things located in the U.S., you'd be tempted to write a query like
select ?s {
?s dbpprop:country dbpedia:United_States .
}
try it on the DBpedia SPARQL endpoint
at first glance this seems to work, after all, we get a lot of results. (Note that a lot of these results are for movies set in the U.S., but that's alright, because we can filter on type.)
If you write more complex queries (say to find the ten largest cities in the U.S. or the longest river wholly contained in Russia) you get answers too... But frequently the wrong answers. We can get some insight on this by running this
query:
select ?c COUNT(*) AS ?cnt {
?s dbpprop:country ?c .
} GROUP BY(?c) ORDER BY DESC(?cnt) LIMIT 30
try it on the DBpedia SPARQL endpoint
here we're using the GROUP BY
operator in SPARQL 1.1 to look at the 30 most common objects of dbpprop:country
and we get the following result
The #1 topic is the English-language string "United States"@en
and the #2 topic is the link to :United_States
. Further down, we also find "USA"@en
and "US"@en
.
Just in the top 30 topics we see four different identifiers for one concept, and we can easily find more if we look for objects that begin with the letter U
select ?c COUNT(*) AS ?cnt {
?s dbpprop:country ?c .
FILTER(REGEX(?c,"^u","i"))
} GROUP BY(?c) ORDER BY DESC(?cnt) LIMIT 30
try it on the DBpedia SPARQL endpoint
and we get
Although the top four resources surely cover a large majority of occurrences, we see there are still a number of variant forms such as "U.S."@en
, "U.S.A."@en
, as well as case variations such as "United states"@en
.
What's just as problematic is that we get some non-countries such as "Utah"@en
as well as composite countries such as "United States and Canada"
which really ought to be represented as two facts, i.e.
?s dbprop:country :United_States .
?s dbprop:country :Canada
If you look hard enough you find even 27 topics are connected with "America"@en
, which is really awful.
Going to the root cause
The root cause of this is that DBpedia is doing exactly what it is supposed to be doing, that is, reading the contents of Wikipedia infoboxes. We can find the relevant Wikipedia pages by running queries like
select ?s {
?s dbpprop:country "United States"@en
} LIMIT 10
and we see examples like
and
the difference here is that one page has a link to the United States Wikipedia page and the other one doesn't.
In terms of its mission, DBpedia is doing all the right things. The community can edit mappings from Infoboxes to types and properties, and DBpedia Live is updated in real time. DBpedia is as successful as it is because it's based on the success of Wikipedia, but ultimately, Wikipedia is an encyclopedia for humans instead of computers.
It's hard to fix DBpedia because to fix facts in DBpedia, we have to fix them in Wikipedia. To fix them in Wikipedia, we have to edit poorly documented and often inscrutable markup. Any effort to edit Wikipedia in bulk to clean up DBpedia runs the risk of (i) causing more damage than it's worth, (ii) getting undone over time by ordinary Wikipedia editors, and (iii) running afoul of the Wikipedia administrators. On top of all that, the mainstream DBpedia is updated on a slow timescale (DBpedia 3.9 came out in September 2013 and it is July 2014 as I write this.) so you'd have to wait an incredibly long time for things to happen.
Cleanup
I read my first books on data mining back in the early 1990's and one thing I read was that "80% of the effort in a data mining project goes into data cleaning."
I didn't completely believe it at the time, but I did when I was about 20% of the way through a data mining project.
There are two obvious ways to clean up a data set like DBpedia that correspond to the forward-chaining and backward-chaining styles of inference.
One strategy is to apply a cleaning process to a database before we run queries. We clean up the data first, load it into a database, and then do queries. We'd deal with the multiple names by rewriting "United States"@en
and all the other variants to :United_States
.
The advantages of this strategy is that it is done in one phase, so we can draw a circle around the cleaned product, do quality checks and sign off it. If the work is really "done", then query writing is easy in the future.
An alternate strategy is to re-write the queries and reprocess the results to implement cleanup. For instance, we could rewrite the query to search for variant forms:
select ?s {
?s dbpprop:country ?c .
FILTER(?c IN (:United_States,"United States"@en,"USA"@en,...)
}
it can be straightforward to do this on an automated basis in SPARQL because many SPARQL tools, such as Jena, have tools to manipulate queries.
Invalid results can also be filtered post-query.
For instance, if you ever plot DBpedia coordinates on a map, you'll see a mirror image of Europe reflected across the Greenwich meridian because people often get the sign wrong in coordinates. These obviously bad coordinates could be removed or corrected if location types that only appear on land (cities) were checked against a map and verified to be on land -- this kind of verification could be done either during or after query time.
If the dream of large-scale linked data systems that incorporate data in real time from multiple messy sources is to come true, we'll need advances in querying over dirty databases. Until we find a way to beat the 80% figure above, however, data cleaning of Linked Data databases can be a project several times larger than the projects that people want to do -- which means you can blow through your schedule and budget long before you've started "real" development.
Getting it right from the beginning
An alternative strategy is to build a database based on a different workflow. Instead of a 1-1 translation of Wikipedia to data, Freebase and Wikidata create a feedback loop in which data users can fix errors and omissions in the data directly through either a web interface on API.
Freebase and :BaseKB
The most mature generic database today is Freebase. Initially available only through a proprietary query language and a proprietary data dump, my research in 2011 demonstrated that Freebase quads could be translated to industry-standard RDF triples.
Although Freebase switched to the RDF format a few months later, the current Freebase dump is problematic, largely because the Freebase dump is not, itself, part of a feedback loop that guarantees its correctness.
:BaseKB is a purified version of Freebase which is compatible with industry-standard RDF tools. By removing hundreds of millions of duplicate, invalid, or unnecessary facts, :BaseKB users speed up their development cycles dramatically when compared to the source Freebase dumps. Our new RDFeasy products (Compact and Complete) combine purified data with carefully matched hardware and software to support fast and unlimited SPARQL 1.1 queries with a single-click install process.
Wikidata
Wikidata, in some sense, turns DBpedia around, in that Wikidata is intended to be a data source to populate Wikipedia infoboxes than the other way around.
As of now, Wikidata is considerably smaller than either DBpedia or Freebase, containing (as of Feb 2014) only about 30 million facts exclusive of labels, descriptions and aliases.
The state of the art in Wikidata usage is a bit behind Freebase and DBpedia. There is an API and a toolkit that can handle the raw JSON data in Wikidata. RDF dumps are now available, but current products are unsatisfactory because Wikidata's data model is a bit richer than RDF. Wikidata not only records facts, but it also records information about the attribution of facts -- this is awkward because simply omitting qualified statements causes data loss, while accepting all qualified statements means accepting incorrect information. It is possible to express statements with references and qualifiers, but then it becomes difficult to write SPARQL queries.
Wikidata is promising datasource in the long term, but those who need a generic database now are best starting off with :BaseKB.
Ontology of the un-Ontologizable
Although DBpedia's coverage of topics which are easy to ontologize, such as people, places and creative works, will probably always be worse than Freebase, DBpedia has some properties that are unique that are particularly useful for topics that aren't easy to fit in a box.
For one thing, DBpedia captures hyperlinks between Wikipedia topics. Although this only captures the fact that "A relationship exists between ?A and ?B", the sheer volume of this data makes it appropriate for statistical use.
Another valuable, if imperfect, data set are the Wikipedia categories. Critics would point to the at the Wikipedia page for a well-known polymath,
to show that many of these are ridiculous to maintain by hand, such as "1947 births", and "20th-century American male actors", "20th-century Austrian male actors", "People from Graz", etc. Many of these categories are unions of other categories, from which a large number of permutations can be granted. (But don't go too far: his ignominious military career, which ended when he went AWOL to attend a bodybuilding competition, qualifies him as an "Austrian solider" but not an "American solider")
The Wikidata people would like to replace these categories in Wikipedia with categories generates from the properties, and in the case of Mr. Schwarzenegger, that's a good idea. Let's hope this doesn't happen for more abstract concepts, such as
One can find many faults in the categories as they exist (for instance, why is there a category for M-estimators , but no category for L-estimators.) Although we could certainly ontologize some topics in depth (for instance, L-estimators such as the mid-range and the trimean share properties such as the breakdown point and efficiency,) we can't fit every topic into a deep ontology, and the categories let us go into places where we otherwise couldn't.
Creator of database animals and bayesian brains